# Lisa's garden data
data(garden_harvest)
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from devtools::install_github("llendway/gardenR")). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
# group by`veg` then by `day` of the week
group_by(vegetable, day = wday(date,label = TRUE, abbr = TRUE)) %>%
# get value of sum weight in gram
summarise(total_weight = sum(weight)) %>%
# pivot the data set based on day value in day column and value from`total_weight`
pivot_wider(names_from = day, values_from=total_weight, values_fill = 0)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?There are duplicated observation from garden_harvest (using left_join) that get match to each plot and variety of garden_planting dataset. There is limited information to fix this since the there is no way we can know which plot the the harvested crops are from.
garden_harvest %>%
# group by `veg`
group_by(vegetable) %>%
# find sum weight in pounds (lbs) and round to 4 significant figures
summarise(total_weight = round(sum(weight)*0.00220462,3)) %>%
# join the dataset with `garden_planting`
left_join(garden_planting, by="vegetable")
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.You can compute the total weight of harvested crop group by vegetable like what is done in 1 and use information about the cost from the harvest_spending and the market price of each crop from third party sources like Wholefood to calculate the total amount of expenses from gardening and the total price if you were to buy the same amount of product from the market. You can first create a summarized table detailing the name and total weight of the crops. Then you can use the expense information from harvest_spending and market price information from third party sources to create a new table. Compute the total market price and the expense for each vegetable for the harvest weight. The sum of the difference of the two figurs will be the answer if you save or waste money on gardening.
# create table to list the `variety` based on earliest harvest date
garden_harvest %>%
filter(vegetable=="tomatoes") %>%
select( variety, date, weight) %>%
mutate(variety2 = fct_reorder(variety, date, min, .desc = FALSE)) %>%
group_by(variety2) %>%
summarise(early = min(date))
# plot a barplot but specifying the reordering directly in the aes()
garden_harvest %>%
filter(vegetable=="tomatoes") %>%
select( variety, date, weight) %>%
ggplot()+
geom_col(
aes(x = fct_reorder(variety, date, min, .desc = FALSE),
y = weight,
fill= variety))+
labs(x = "Date",
y = "Weight",
title = "Tomatoes: Harvest Weight by Variety",
subtitle =" The varieties are listed from smallest to largest first harvest date")+
theme(axis.text.x = element_text(angle=90),
legend.position = "none")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(variety_lower = str_to_lower(variety),
variety_name_length = str_length(variety)) %>%
group_by(variety, variety_name_length) %>%
distinct(variety_lower)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(arer = str_detect(variety, "er|ar")) %>%
filter(arer== TRUE) %>%
distinct(variety)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x=sdate))+
geom_density()+
labs(title="Trips dy Date", x="Date", y="Trips", caption = "by Vichearith Meas")
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
select("duration","sdate") %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
ggplot(aes(x=time_of_day))+
geom_density()+
labs(title="Trips dy Time of Dat", x="Hour of day", y="Trips", caption = "by Vichearith Meas")
Trips %>%
select("duration","sdate") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(y=day_of_week, fill = day_of_week))+
geom_bar()+
theme(legend.position = "none")+
labs(title="The Events versus Day of the week", y="Day of Week", x="Trips", caption = "by Vichearith Meas")
There is a pattern that for weekdays, there is a spike of trip around 8:00 am in the morning and another spike at 5:30 pm in the evening. For weekends, the trend grows slowly generally starting at 5:00 am and hits a peak at around 12:00 pm and 1:00 pm. Then the trend starts to decrease for the rest of the day.
Trips %>%
select("duration","sdate") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
ggplot(aes(x=time_of_day, color=day_of_week))+
geom_density()+
facet_wrap(~day_of_week)+
theme(legend.position = "none")+
labs(title="Trips by Time of Day", x="Hour of day", y="Trips", caption = "by Vichearith Meas")
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
select("duration","sdate", "client") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
ggplot(aes(x=time_of_day, color=day_of_week))+
geom_density(aes(fill = client, alpha=0.5), color=NA)+
facet_wrap(~day_of_week)+
theme(legend.position = "none")+
labs(title="Trips by Time of Day", x="Hour of day", y="Trips", caption = "by Vichearith Meas")
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?In my opinion, the first graph describe the behaviour of each group throughout the day better. It gives a better understanding about when specific group of renter start their trips and overall characteristic of each group (peak time, how many at what time). By overlaying one over another, the first graph also give a better understand of when the two group overlap on tech other. The second graph gives more understanding about the proportion of the two group at a given time of the day. Both gives a direct comparison that at a given time of the day, which group have more
Trips %>%
select("duration","sdate", "client") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
ggplot(aes(x=time_of_day, color=day_of_week))+
geom_density(
aes(
fill = client, alpha=0.5),
color=NA,
position=position_stack())+
facet_wrap(~day_of_week)+
theme(legend.position = "none")+
labs(title="Trips by Time of Day", x="Hour of day",
y="Trips",
caption = "by Vichearith Meas")
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
select("duration","sdate", "client") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
mutate(weekend = ifelse(day_of_week %in% c("Sun","Sat"), "weekday", "weekend")) %>%
ggplot(aes(x=time_of_day, color=day_of_week))+
geom_density(
aes(
fill = client, alpha=0.5),
color=NA)+
facet_wrap(~weekend)+
theme(legend.position = "none")+
labs(title="Trips by Day of the Week", x="Hour of day",
y="Trips",
caption = "by Vichearith Meas")
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?In previous graph, we can see the renter behavior of two groups in weekend and weekdays. In this graph, we focus on only the difference in behaviour of all renter in weekend and weekdays.
Trips %>%
select("duration","sdate", "client") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
mutate(time_of_day = hour(sdate)+ (round((minute(sdate)/60),2))) %>%
mutate(weekend = ifelse(day_of_week %in% c("Sun","Sat"), "weekday", "weekend")) %>%
ggplot(aes(x=time_of_day, color=day_of_week))+
geom_density(
aes(
fill = weekend, alpha=0.5),
color=NA)+
facet_wrap(~weekend)+
theme(legend.position = "none")+
labs(title="Trips by Day of the Week", x="Hour of day",
y="Trips",
caption = "by Vichearith Meas")
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
rename(sstation = name) %>%
left_join(Trips, by= 'sstation') %>%
ggplot(aes(y = lat, x = long, color= sstation))+
geom_jitter(alpha= 0.25)+
theme(legend.position = "none")+
labs(title="Location of Departure Stations", x="Longtitude",
y="Latitude",
caption = "by Vichearith Meas")
Most caual renters use stations that go no further than 38.95 in latitude. In general, the they use station that is in center of the cluster.
Stations %>%
rename(sstation = name) %>%
left_join(Trips, by= 'sstation') %>%
# Plot only the Causal renter
filter(client == "Casual") %>%
ggplot(aes(y = lat, x = long, color= sstation))+
geom_jitter(alpha= 0.25)+
theme(legend.position = "none")+
labs(title="Location of Departure Stations", x="Longtitude",
y="Latitude",
caption = "by Vichearith Meas")
as_date(sdate) converts sdate from date-time format to date format.ten_stations_date <- Trips %>%
mutate(station_date = paste(sstation, as_date(sdate), sep = " " )) %>%
group_by(station_date) %>%
summarize(number_departure = n()) %>%
arrange(-number_departure) %>%
slice(1:10)
ten_stations_date
Trips %>%
mutate(station_date = paste(sstation, as_date(sdate), sep = " " )) %>%
right_join(ten_stations_date, by = "station_date")
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
Trips %>%
mutate(station_date = paste(sstation, as_date(sdate), sep = " " )) %>%
right_join(ten_stations_date, by = "station_date") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
group_by(client, day_of_week) %>%
summarise (day_count=n()) %>%
mutate(proportion = paste(round(100 * day_count/sum(day_count), 0), "%"))
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?